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Abstract 



Workloads generate a variety of disk I/O requests to access file information, execute 
programs, and perform computation. I/O caches capture many of these requests, reducing ex- 
ecution time, providing high I/O rates, and decreasing the disk bandwidth needed by each 
workload. Workload component characterization shows file type and size information can be 
used to group requests with similar reuse rates and access patterns. 

Attribute caches have various partitions to capture the statistically distinct component be- 
havior of the workload, each tailored to cache files with certain properties or attributes. Infor- 
mation about an I/O request becomes an attribute that determines how best to cache a request. 
Using attributes, cache resources are allocated to capture specific types of I/O data locality. 

The paper develops an attribute cache scheme to improve total I/O cache performance. The 
scheme relies on workload characteristics to determine the appropriate cache configuration for a 
given cache size. For a set of eleven measured workloads, it reduced the miss ratio 25-60% 
depending on cache size, and required only about 1/8 as much memory as a typical I/O cache 
implementation achieving the same miss ratio. 
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1 Introduction 



The performance of an I/O cache depends on its ability to capture workload locality. An effective cache 
closely matches workload needs with cache size, configuration and data management policies. Although 
the basic techniques for selecting the proper I/O cache size and configuration are known, designers rarely 
use them to select or tune I/O caches. The task of tuning an I/O cache is daunting, especially given that the 
cache must function under a wide range of workload environments and system memory configurations. 

Several studies characterize the I/O workload and show that I/O caches reduce I/O traffic in a distributed 
file system and provide reasonable system performance [9, 2, 8, 12]. The characterizations show that the 
majority of file accesses are to small files of less that 10 Kbytes, that files tend to be accessed sequentially, 
and that the majority of bytes transferred reside in large files. Over time the size of large files has increased, 
and the length of sequential runs has increased. Files and sequential runs of larger than one megabyte are 
common. The individual data request size is typically determined by the system libraries so long runs are 
made up of many individual requests. 

In Unix systems disk files can be classified as accesses to inodes, directories, datafiles or executables. 
Each has distinct cache behavior [12]. Inodes and directories are small and highly reused files, while 
datafiles and executable files have more diverse characteristics. The smaller ones exhibit moderate reuse 
and have little sequential access, while the larger files tend to be accessed sequentially and infrequently 
reused. Properly used, file type and file size information improves cache performance. 

Attribute caches use directives in the form of file attributes. They improve cache performance by more 
closely matching the cache with expected workload behavior Uniform cache schemes try to best capture 
the access behavior of the entire workload. The resulting cache does capture locality, but designing to the 
statistical properties of the entire workload limits its effectiveness. 

Attributes indicate files with similar cache behavior. Attribute caches more efficiently hold individual 
data requests because each request has a narrower range of expected behavior for each request. Attribute 
caches differ from attribute caching. Attribute caches use attributes to guide data management, where as 
attribute caching refers to storing file attributes in a cache [14]. 

The remainder of this paper focuses on attribute caches. Section2 describes the workloads and the 
attribute cache terminology. Section3 examines attribute cache design trade-offs in various regions of 
operation. Section 4 describes and evaluates the performance of an attribute cache scheme that varies with 
cache size to substantially improve I/O cache performance. Section 5 concludes the paper Two Appendices 
describe the trace collection and simulation techniques and the workload component behavior in I/O caches. 

2 Background 
2.1 Workloads 
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Kernel Build trace is a configuration and compile of the ULTRIX kernel (2 hours). 

Ingres Transactions performs 10,000 banking debit and credit transactions on an Ingres database. The transactions are entirely 
random, and there are no complicated searches. 

Application Data Analysis manipulates a series of traces, simulates caches and displays cache results. 

Software Development 1 (8 to 5) develops and tests the ATOM simulation system. The primary system user turned tracing on 
when they arrived, and off when they left for the day. 

Software Development 2 (24 hr.) is the same basic environment as Software Development (8 to 5). The trace includes idle time 
at night and all the maintenance activity that goes on at night ( 4 days). 

Document Preparation records work on a technical report which includes text processing, editing, simulation, drawing, and data 
manipulation ( 6 hours). 

Mecca Development (24 hr.) records the development and testing of a centraUzed e-mail system. 

Mecca Development with Server (8 to 5) is the same basic environment as Mecca Development (24 hr.), except that (1) the 
Ingres server used to direct mail resides on the traced machine, and (2) the traces were collected only during the day. 

CAD: Chip Build traces the construction of a CPU layout from a high level description. The workload constructed the layout of 
the chip and ran design tests on the compiled chip description. 

Network Update (24 hr.) mostly consists of a large network gather-scatter operation. The operation gathered information from 
the whole DEC NET and then updated the net with new information. 

Compute Server (24 hr.) consists of batch simulations running on a machine with an idle console. 

Figure 1 : Eleven traced workload environments 

I/O cache performance evaluation requires collecting I/O traces or installing instrumentation. Traces are 
needed by all designers to simulate and directly compare various cache alternatives and understand workload 
locality. Generating traces is difficult, and the type of tracing or instrumentation determines which I/O 
characteristics are visible [15]. Some I/O cache studies have used disk requests to study cache performance 
[15, 10, 1 1], while others have used operating system traces of file system activity [9, 8, 3, 13, 6, 12]. 

Figure 1 describes the eleven workloads evaluated in this paper. Most of the workloads cover several 
days of user activity. These long traces capture significant I/O activity and show the interaction of the many 
large and small files that comprise a workload. 

The workload traces cover a wide range of user applications and types of work. Although not necessarily 
typical of heavy commercial use, such as large database systems, they represent many engineering devel- 
opment and office environments. Appendix A discusses the trace collection methodology and Appendix B 
shows workload component behavior. More detailed descriptions of the workloads are included in [12]. 

2.2 Attribute Cache Framework 

Attribute Cache: An 1/0 cache that uses file information to choose the cache strategy for I/O requests 
associated with each file. 

Attribute: An attribute indicates the expected cache behavior of a file. Attributes may be known features 
of a file, or they may be explicitly assigned to the file. A file may have multiple attributes that define 
its expected cache behavior. 
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Figure 2: Anatomy of an attribute cache. 

Cache Categories: The different cache strategies, or subcaches, used by the attribute cache. The file 
attribute determines the category for an individual request. 

Attribute Class: The set of files that map to a particular cache category. 

Figure 2 shows the attributes, cache categories and attribute classes used by one attribute cache. This 
attribute cache has three separate subcaches, one for each category of expected cache behavior. The three 
cache categories are ID, temporal, and sequential. The file attributes are the four file types inode, directory, 
datafile, and executable, and an assigned attribute large derived from the file size and the file cut-off value. 
The attribute classes map attributes to cache categories. For example, files with inode or directory as their 
attribute belong to the ID class, and get assigned to the ID subcache. 

3 Attribute I/O Caches 

This section describes the design trade-offs and performance for possible attribute cache schemes. These 
attribute cache schemes use fixed cache partitions for different file attributes. Each partition has a block 
size designed to capture the locality of files with particular attributes. Because the type of locality a cache 
captures depends on cache size, the best partitioning varies with the size of the cache. 
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Figure 3: Unix style cache read request miss ratio. 

Allocating resources to cache partitions involves matching the existing workload locality with the proper 
cache resources necessary to capture the locality. An efficient allocation of resources gives cache space 
to the component or components that can capture the most locality in that space. A scheme wastes cache 
resources if it allocates space to components that cannot effectively use the space to reduce misses. Since 
the working set size varies for the different components, the cache partitioning can favor components whose 
working set can be captured at a given size. 

3.1 Baseline for Comparison 

A Unix style cache will be the baseline comparison for attribute caches. The Unix baseline allocates 32 
Kbytes for inodes, and divides the rest of the cache into 4-Kbyte blocks. It is fully associative, and has LRU 
replacement. The cache allocates writes. Figure 3 shows the baseline read request miss ratio for all the 
workloads. All subcaches partitions are fully associative allocate on write, and perform LRU replacement. 
For write data security the caches are assumed to be non- volatile. 

There are three major cache size regions: The small I/O cache region, the working set capture region, 
and the large I/O cache region. Caches in the small region cannot capture the expected working set of 
the entire workload. Caches in the large cache region are big enough to capture the working set of most 
workloads. The working set capture region covers the intermediate sizes. 

Each of the three cache size regions needs to capture a different sort of locality: (1) Small caches have 
very limited space and should be designed to capture locality that requires little space. A small cache can 
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Figure 4: Logical configurations for attribute I/O caches. 



potentially capture inode and directory requests, some small amount of datafile and executable temporal 
locality and some sequential behavior (2) Caches in the working set capture region are large enough that 
they should easily capture the inode and directory working sets. The cache should be tailored toward holding 
the datafile and executable temporal working set, and then providing adequate support for capturing some 
sequential locality. (3) Large caches have sufficient space to capture the temporal locality of inodes and 
directories and of the datafiles and executables. The cache needs to capture the remaining sequential locality 
and reuse of large sequential files, which have a reasonably long time period between reuse. 

Each region has unique constraints and workload behavior, and will be evaluated individually. The 
boundaries between the three regions are not rigid, but for simplicity the regions are defined as follows: 

Small I/O Cache: 128 Kbytes or less. 
Medium I/O Cache: 256 Kbytes to 4 Mbytes. 
Large I/O Cache: 8 Mbytes or more. 

3.2 Attribute Cache Design Parameters 

Figure 4 shows the cache configurations used to demonstrate the usefulness of attributes. Four separate 
subcaches are used to construct two-category and three-category attribute caches. 

The inode and directory subcache, also called the ID subcache, stores inodes and directories in small 
128-byte blocks, packing many objects in a small space. The general I/O subcache is designed to capture 
both the temporal and sequential locality of non-ID requests. The temporal subcache caches the bulk of the 
datafile and executable references; its size determines whether the cache can capture the whole workload 
working set. The sequential cache captures large sequentially accessed objects or large objects with very 
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low expected reuse. All the results shown use a 512-Kbyte file cut-off to split file references between the 
temporal and sequential subcaches. 

To concisely describe attribute cache organizations requires a notation convention. The important 
characteristics of each subcache are the cache size, the block size, and the cache category. Each subcache 
will be described as follows: 

Cache_size/Block_size Cache Category 

A string of subcache definitions describes a complete attribute cache configuration. For example, a 
two-category attribute cache with a 64K inode and directory cache and a general cache with 8-Kbyte blocks 
is described as follows: 

64K/128 ID, X/8K General 

The ID cache is fixed at 64-Kbytes and the General cache size determines the total cache size. A 128 
Kbyte I/O cache would have a 64 Kbyte general cache, where as a 1 Mbyte cache would have a 960 Kbyte 
(1 MB less 64 Kbyte) general cache. From the size and blocks size it is easy to determine the number of 
cache blocks. The 64K/128 ID cache has 512 blocks. 

3.3 Small I/O Cache Region 

The amount of existing locality any cache captures depends on the cache configuration, especially among 
small caches. Small I/O caches cannot capture the working set of most workloads, so configurations that 
use cache area more efficiently capture a greater fraction of the workload behavior. 

Several techniques increase the cache utilization, including (1) only storing frequently reused data in the 
cache; (2) increasing the number of objects stored in the cache by excluding larger objects; and (3) matching 
the cache size to the cache configuration that best captures locality. Directories are the most reused file 
type, followed by inodes. Even a small cache can hold many inodes and directories. A small cache can also 
potentially capture the small working sets typical of inodes and directories. 

Datafiles and executables exhibit some locality that a small cache can easily capture. The datafile and 
executable request miss ratio of Figure 15 shows significant locality capture with only four cache blocks. 
Beyond four blocks, reductions in request misses diminish until the cache has captured the entire working 
set. Many workloads have spatial locality that a cache using large blocks can capture. In fact, a single large 
block suffices to capture this locality. 

Figure 5 compares several two-category attribute cache configurations against the baseline. The config- 
urations are designed to capture locality in the small cache region; the region over which the configurations 
are compared is between 4 Kbytes and 256 Kbytes. As the cache size increases, the cache options increase. 
Allocating the entire cache to inodes and directories can be advantageous if the total cache is very small — 
four to sixteen kilobytes, depending on the workload. Allocating resources to both the ID subcache and the 
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Figure 5: Small cache options. 

general I/O subcache captures the highly reused inodes and directories, as well as some temporal locality for 
the datafiles and executables. Increasing the block size in the general subcache from 4 Kbytes to 16 Kbytes 
captures some of the sequential locality for the datafiles and executables. As the cache size increases, so 
should the inode and directory partition, and the block size of the general subcache. 

3.4 Medium I/O Cache Region 

The medium I/O cache region coincides with the range of sizes that capture the working set. Capturing 
the working set significantly reduces request misses. All medium I/O caches are large enough to capture 
the inode and directory working set. Medium cache designs need to concentrate on reducing the datafile 
and executable request misses without increasing the cache size required to capture the 

Datafiles and executables have both temporal and sequential locality. File size provides a simple 
mechanism for separating the temporal and spatial locality of executables and datafiles. This makes it 
feasible to tailor the cache management to the expected locahty of each request, rather than to the average 
locality of the entire workload. Sequential data can be cached in large blocks, while small highly reused 
files can be cached in small blocks. 

Managing temporal and sequential locality separately provides several potential advantages. Split 
management can directly increase locahty capture and reduce cache pollution. The temporal cache uses 
small blocks to reduce wasted space and capture its working set in a minimal area. The sequential cache uses 
a few large blocks to capture the majority of the sequential behavior in a small area. Limiting the amount 
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Figure 6: Medium cache options 

of space sequential data can occupy in the cache reduces cache pollution. A separate sequential subcache 
holding only large files can also prevent large files with little sequential locality, such as executables, from 
polluting the other subcache. 

By separating the large datafiles and executables into the sequential attribute class, and the remaining 
datafiles and executables into the temporal attribute class, the attribute cache can capture both the temporal 
and sequential behavior in a smaller cache. 

In the medium size region, the goal is to significantly reduce the number of sequential accesses to large 
files, while not significantly increasing the required working set size for the remaining files. This requires 
limiting the sequential cache space until the cache has captured the working set of the workload. Since 
there is no way to determine if a running workload has captured its working set, a conservative approach 
is necessary. For the workloads studied, the working set size ranges from 512K to 16M. Only Ingres, the 
synthetic workload, required 16M. Of the remaining workloads, none required more than 2M. 

Figure 6 compares the read request miss ratio for several three-category attribute caches against the 
baseline cache. The configurations are designed to capture locahty in the medium cache region; the region 
over which the configurations are compared is 256 Kbytes to 4 Mbytes. The figure shows the full range 
of behavior for each cache configuration. The plot resolution is such that the smaller caches for each 
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configuration appear to be the same size. For example the 

512K/128 ID, X/4K temp, 2M/64K seq 

cache configuration has a 512 Kbyte ID subcache, a 2 Mbyte sequential subcache, and a temporal subcache 
ranging from 4Kbytes to 256 Mbytes. On the log scale, the caches with temporal subcaches from 4K to 
128K bytes all appear as 2.5 Mbyte caches even though they vary in size from 2564 to 2688 Kbytes. 

The sequential subcache uses 64-Kbyte blocks to significantly reduce sequential misses. Each workload 
requires a certain size temporal cache to capture the working set. Larger sequential or ID subcaches capture 
more locality, but also shift the total cache size required to capture the working sets. 

The MECCA Server workload has one of the largest working sets. The sequential subcache causes an 
increase in the cache size required to capture the working set, and produces no reduction in the miss ratio. 
The ID subcache also increases the cache size for working set capture, but a larger ID subcache reduces the 
cache miss ratio. Most of the workloads, however, use the sequential subcache to some extent and show 
lower read request miss ratios in this middle range. 

3.5 Large I/O Cache Region 

In the large cache size region, the prime objective is to capture the entire workload with as few request 
misses as possible. There is adequate cache to capture the working set, and to capture reuse of the large files. 
Neglecting the large file reuse does not always significantly impact the total requests (a IM file requires 
only 8 128-Kbyte block requests), but it does affect the total bytes transferred. Since large files constitute 
the majority of bytes transferred, capturing their reuse is critical to keeping the disk data transfer time low. 

Figure 7 compares three-category attribute caches having large ID and sequential subcaches against the 
baseline cache, and against a two-category cache having 16-Kbyte blocks in its general I/O subcache. The 
figure shows both 64- and 128-Kbyte block sizes for the sequential subcache. The additional benefit from 
doubling the sequential block size depends on the amount of sequential behavior in the workload. If the 
workload has a large sequential component, the larger blocks reduce the miss ratio. With a multi-megabyte 
sequential cache, the large number of blocks suffices to eliminate any conflicts. At such a size, increasing 
the block size to 128K does not increase the read request miss ratio for any workload. For most of the 
workloads, increasing the sequential subcache size beyond two megabytes does little to reduce the read 
misses, but it significantly reduces the number of bytes transferred to and from disk. A 16-Mbyte sequential 
subcache captures the large file reuse for all of the workloads. 

Increasing the inode cache beyond 128K reduces the misses on several workloads. In general, little 
gain accrues from increasing the ID cache beyond 128K, unless goes to at least IM. A couple of workloads 
see a significant reduction when the ID subcache increases to IM, and a few more when it goes to 4M. 
This results from application that read all the meta-data on a disk, suggesting that an alternative mechanism 
should really be used to support these applications. 
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Figure 7: Large cache options 
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Figure 8: Variable attribute cache configurations for each size region. 



4 Variable Attribute Cache Scheme 



The variable attribute cache scheme uses attributes to substantially reduce read request misses. The scheme 
was designed based on an evaluation of the overall workloads. It is neither an optimal solution given the 
cache partition requirement, nor an optimal solution given the set of experiments simulated. Many cache 
choices depend on the expected workload. The goal was to pick a simple scheme that works well over a 
broad range of workloads and for many potential disk or network systems. 

No single cache configuration produces low miss ratios over a broad range of caches, hence the scheme 
varies with cache size. The design exploits common workload behavior and systematically varies the 
attribute cache configuration along with its cache size to capture the appropriate behavior. The resulting 
design is not optimal, but it shows the type of benefits that could be expected from a real attribute cache 
scheme. 

Figure 8 shows the attribute cache configurations and the general policy governing subcache space 
allocation for each of three cache size regions. In the small cache region, the scheme uses a two-category 
attribute cache, allocating half the cache to inodes and directories, and partitioning the other half into a 
general subcache having four equal blocks. The general subcache is designed to capture both temporal 
and sequential locality. In the medium cache size region, the scheme allocates the bulk of the area to the 
temporal subcache, which uses 4-Kbyte blocks to capture the temporal working set. It allocates from 64 to 
128 Kbytes to each of the ID and sequential subcaches to capture ID requests and sequential requests. In 
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Table 1 : Variable Attribute Cache Scheme. 

the large cache region, the scheme increases the sequential subcache so that it occupies a large part of the 
cache, starting at about one-third and increasing to one-half at the high end of the cache region. The ID 
subcache remains fixed at 128 Kbytes until the cache becomes large enough to support a multi-megabyte 
ID subcache, at which point it expands to 25% of the cache area. 

Table 1 describes the exact cache configurations used in each region by the variable attribute cache 
scheme. 

4.1 Read Request Behavior 

Figure 9 compares the variable attribute cache scheme with the Unix baseline scheme for the four 
representative workloads. The variable scheme lowers the RRMR across the full cache range. The resulting 
miss ratio usually corresponds with that of caches eight times the size. For many medium and large caches, 
however, the variable scheme produces RRMR's below that of the maximum Unix baseline cache. 

The MECCA Server workload exhibits anomalous variable-scheme cache behavior. As previously 
noted, the workload locality is only captured by large cache blocks. In the small cache region, the variable 
scheme uses larger blocks which are ideal for this workload. In the medium cache region, the variable 
scheme changes to 4K blocks in the temporal cache and partitions space for a sequential cache. The 4-Kbyte 
blocks produce comparable miss ratios for each of the two schemes, but the MECCA Server workload, 
benefits little from the sequential subcache. This subcache sits unused, increasing the total space required 
to capture the working set. 

Figure 10 compares changes in the read requests for the four sample workloads, relative to the Unix 
baseline cache. The workloads see dramatic reductions in their read request misses for both small and very 
large caches, and moderate reductions over a broad range of middle cache sizes. These reductions result in 
fewer disk read accesses and fewer times when applications must wait for I/O requests to complete. 
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Figure 9: Attribute cache scheme request miss ratios. 
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Figure 10: Relative read misses for the four sample workloads. 



Small Cache Behavior 

In the small cache region, the variable scheme reduces read request misses by capturing more inode and 
directory temporal locality and more datafile and executable sequential locality. The ID subcache eliminates 
unused space within a block for directories. The same cache size thus holds many more directory entries. 
The ID cache also protects inode and directory data from being evicted by datafiles and executables. In 
the Unix I/O subcache, directories compete with datafiles and executables. In the small cache region, this 
competition prevents the directories from capturing their working set. Since the combination of inodes and 
directories produces roughly 75% of the total requests, increased capture produces large reductions in the 
total disk requests generated. 



Medium Cache Behavior 

For medium cache sizes, the variable attribute cache scheme performance depends highly on the exact 
nature of the workload. Figure 10 shows, for certain cache sizes, that the variable scheme reduces misses 
by 75% in the Application Data Analysis workload, but produces 25% more misses for the MECCA Server 
workload. In this region, both cache schemes capture similar amounts of inode and directory requests. The 
Unix subcache is large enough to capture most of the directory working set even with the competition from 
datafiles and executables. Directories are reused frequently, so only large runs of datafiles or executables 
throw out the working set. The variable scheme protects the directory working set from being periodically 
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thrown out of the cache, but the benefits here are much less than in the small cache region. 

The sequential subcache acts as a liability or an asset depending on workload behavior. Workloads 
with few large file requests reap no benefit from the sequential subcache, and it goes unused. The unused 
sequential subcache merely increases the total cache size required to capture the workload working set. 
Workloads that do have large sequentially-accessed files show large read miss reductions. The sequential 
subcache directly reduces the sequential read request misses. With typical 8-Kbyte read requests, the 64- 
Kbyte cache blocks can reduce sequential misses by 88%. Segregating large files to the sequential subcache 
reduces cache pollution in the bulk of the cache. The temporal subcache protects its working set from large 
files, which lowers the disk accesses needed to service the temporal read requests. 

Large Cache Behavior 

Performance trade-offs for large caches smaller than 24 Mbytes resemble those in the middle region, except 
that both the Unix baseline and the variable scheme captures the temporal working set. The sequential 
subcache captures locaUty without occupying space that might otherwise allow capture of the temporal 
working set. Increased sequential cache size begins to capture reuse for some large files. 

Beyond 24 Mbytes, the variable attribute cache scheme significantly outperforms the Unix baseline. 
Both schemes have read request miss ratios below 5%. Most of the misses come from inodes and directories, 
or from datafile and executable cold misses. Some workloads touch many inodes and directories. Small 
ID subcaches still capture 90% of the requests, but at low total cache miss ratios the uncaptured inodes 
dominate the total misses. The large ID subcache captures reuse for these inodes and directories. Large 
files generate an excessive number of cold misses, caused by low reuse rates and small requests generating 
many misses. The sequential subcache fetches large blocks, reducing the number of cold misses generated 
by large files. 128-Kbyte blocks reduce large file cold misses by more than 90%. 

4.2 Write Expulsion Behavior 

Main memory caches capture only a limited amount of write data locality. If the cache stores the only 
data copy, the data is vulnerable to loss. Writing all new data directly to disk protects the data but results 
in high write disk traffic. Non-volatile caches provide reliable data storage, which allows newly written 
data to reside in the cache for extended periods of time. Non-volatile caches allow write locality to be 
captured. Reddy [10] characterizes the behavior of write data in non-volatile caches. Writes have lower 
miss ratios than reads, signifying better spatial locality. Using non-volatile caches, the read to write 
ratio of disk requests remains about constant across cache size, whereas with volatile caches, the write 
component becomes increasingly dominant with larger cache sizes. Two non-volatile cache organizations 
were evaluated by Baker et al. showing the feasibility of adding non-volatile RAM beside a volatile 
cache [2]. One megabyte of non-volatile cache reduces the number of bytes written to disk by 40-50%. 
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Non-volatile caches are becoming more common, and as the price of non- volatile RAM drops they will 
become more commonplace. 

In a non- volatile cache, writes only generate disk requests when the cache evicts the data. With an early 
eviction policy, or with sufficient buffering, the operating system schedules disk writes when the disk is 
otherwise idle. Writes impact I/O performance indirectly, through resource contention. Excessive writes 
will increase the read request service time, since reads will wait more for write accesses to complete. I/O 
caches need to have a stable write performance. 

Figure 1 1 shows the write expulsion characteristics for the MECCA Server and the CAD Chip Build 
workloads. Each block of write (or dirty) data evicted from the cache produces an independent disk write 
request. The write expulsion ratio measures the number of disk accesses relative to the number of write 
requests in the workload. The write cache behavior differs in several ways from the read behavior. (1) In the 
smallest cache, the variable scheme produces as many write expulsions as the Unix baseline scheme does. 
Inodes and directories compete for space in the ID subcache. Since most inodes are updated with file access 
time information, they generate writes when they are evicted. Once the variable scheme's ID subcache 
reaches a size twice as big as the Unix scheme's inode subcache, the write expulsions drop considerably. 
(2) Sequential subcaches typically increase the cache size needed to capture the write working set. At the 
working set capture size, the baseline scheme often requires fewer disk writes. (3) Large caches have few 
if any evictions, so the write expulsions approach zero. The Unix baseline has a fixed inode subcache that 
always generates inode writes. 

Figure 1 1 also compares the expulsion writes of the four sample workloads to the Unix basehne. Except 
for small caches and narrow cache regions corresponding to the workload working set size, the variable 
scheme produces fewer write expulsions. Increasing the cache size increases the number of inodes in 
the cache thereby reducing the write expulsions to zero. The variable attribute cache scheme produces 
competitive write expulsion behavior. 

4.3 Variable Attribute Cache Compared with Other Schemes 

The variable attribute cache scheme outperforms other schemes by exploiting the distinct cache behavior 
of workload components, and by varying the cache scheme to best use the cache area for locality capture. 
Figure 12 shows a variety of fixed cache schemes in relationship to the variable attribute cache scheme. 
Comparing the read request miss performance shows the strengths and weaknesses of each cache scheme, 
as well as the cache range over which they perform the best. 

Unix baseline (32K/128 inode, x/4K Unix) 
Unix style (32K/128 inode, x/16K Unix) 
Unix style (32K/128 inode, x/32K Unix) 

For small caches, these three schemes cannot capture sufficient directory locahty. Increasing the block 
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Figure 1 1 : Write behavior 
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Figure 12: Fixed schemes compared with the variable attribute cache scheme. 
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size in the non-inode part of the Unix style I/O cache only exacerbates the situation. For medium and 
large caches, increasing the block size reduces the request misses, once they have captured the working set. 
The larger block size increases the cache size required to capture the working set. For many workloads, 
inadequate inode cache size severely limits the cache performance. These configurations do, however, 
compete favorably for a couple of workloads. The MECCA Server and Network Update workload have 
highly spatial requests to many medium-sized files, and relatively few ID requests. The larger block size 
captures this locality. Once the entire working set fits in the cache, other requests dominate. 

Two-category (64K/128 ID, x/8K general) 
Two-category (64K/128 ID, x/16K general) 

Support for inode and directory caches reduces the miss ratio considerably for small caches, especially 
when compared with Unix style caches having identical block sizes. Larger block sizes reduce the request 
misses in workloads with much sequential locality. For workloads with more temporal locality, the larger 
blocks improve performance in the middle cache range, but not for larger caches. The 64K ID subcache 
shows performance benefits even for large caches. 

Three-category (64K/128 ID, x/4K temporal, 64K/64K sequential) 

The three-category attribute cache performs poorly in small caches, because it cannot capture any 
temporal locality beyond that of the inodes and directories, because it dedicates most of its space to the 
sequential and the ID subcaches. In the mid-range, it performs very well on workloads having large files 
with both a sequential and a temporal component. 

For large caches, the scheme fails for many reasons. (1) The sequential cache is too small to prevent 
conflicts among large files, or to capture any reuse. (2) The temporal cache cannot capture any sequential 
behavior. (3) The ID subcache performs the same as the two-category scheme. 
Variable attribute cache scheme 

The variable attribute scheme significantly reduces read misses for both small and large caches by 
capturing ID locality and large-file sequential behavior. In the mid-size region, it reduces misses best when 
the workload has significant temporal locality and large-file sequential locality. Here, the temporal working 
set is protected from sequential sweep behavior, and the sequential cache explicitly captures sequential 
locality. The scheme fails to capture sequential locality for medium-sized files. These can add significantly 
to the request misses if the cache size is smaller than the working set capture size. A larger block size for 
the temporal cache might further reduce the read misses for mid-sized caches. 

4.4 Total Performance Results 

Each read request miss and write expulsion generates a disk access. Figure 13 shows the number of 
disk accesses for the variable attribute cache scheme relative to the Unix baseline. The figure breaks read 
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Figure 13: Read and write disk requests performance for all workloads. 

and write disk accesses down separately, because they impact system performance in different ways. 

Disk accesses from read request misses determine how many times applications wait for I/O to complete, 
and the minimum number of context switches required to overlap computation with the I/O. The variable 
attribute cache scheme reduces the number of read disk accesses for almost all workloads over a full range 
of I/O cache sizes. Averaging over the workloads, it reduces the read accesses by at least 18% and as much 
as 66% depending on the cache size. The overall reduction averaged 48% in the small cache region, 28% 
in the middle cache region, and 58% in the large cache region. 

Disk accesses from write expulsions increase the disk utilization and the probability that the disk will be 
busy when a read request miss occurs. Over most cache sizes, the variable attribute cache scheme does little 
to reduce the write expulsions. With the variable scheme, some workloads generate more writes than the 
baseline scheme, and others generate fewer writes. The write working set size is somewhat larger than the 
read working set. The variable attribute cache scheme frequently requires additional cache space to capture 
this working set. The spikes in the write expulsion graph correspond to these differences in working set 
capture. 

Figure 14 sums the read and write disk access, showing the total disk accesses. For small caches, the 
writes increase the total relative disk requests, but for caches above 1 Mbyte they reduce the total relative 
requests. The total disk accesses reduce by an average of 38% in the small cache region, 31% in the middle 
cache region, and 66% in the large cache region. 
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Figure 14: Total relative disk requests. 

5 Conclusions 

Attribute caches capitalize on the distinct access patterns of different file types and sizes. Each subcache 
uses a different block size to capture the locality of its attribute class. Allocating small files to small 
blocks increases the number of independent files stored in the cache. Allocating medium files to mid-sized 
blocks captures temporal locality with a smaller cache. And allocating large files to large blocks captures 
sequential locality. Matching the block size to the attribute class reduces unused cache space, increases 
cache utilization, and reduces the number of request misses. 

The requests form large files do not fill up the entire cache, forcing out small files. The allows the cache 
to capture the working set of individual components even if the workload working set is larger than the 
cache. Capturing the working set of individual components significantly reduces the total request misses. 

The subcache partitions must change with cache size to capture the greatest locality. By defining a 
different the cache partition based on cache size, the variable attribute cache scheme performs well over 
the full cache size range. 

When compared with a Unix Style cache the variable attribute cache reduces the read disk requests by at 
least 18% and as much as 66% depending on cache size. Writes to disk decrease as the individual subcaches 
partition size increases. Large attribute caches have very few writes to disk. The reduction in read accesses 
reduces the total time required to service disk requests with any size cache, whereas, reductions in disk 
writes mainly reduce the disk service time for when the cache size is large. 
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A Trace Collection and Simulation 

An I/O workload trace should contain block access patterns as well as file and application information. File 
access patterns suffice to model broad I/O cache performance, but cannot provide the link between workload 
activities and cache behavior. Additional information is required to understand the nature of application I/O 
requests, and improve I/O cache performance in non-ad-hoc ways. 

Relating cache performance to application behavior requires file system information along with appli- 
cation I/O requests. Understanding the nature of application I/O requests can drive I/O cache performance 
improvements or application I/O optimizations. 

A.1 Trace Collection 

A new version of the WRL tracing facilities collected traces on DECstation 5000's miming ULTRIX [4, 5]. 
Its kernel-based approach traces all processes. The modified system logs system call information in a 
physically mapped trace buffer. On an I/O system call, the call type, process ID, and call parameters are 
entered in the buffer. On return from the system call, the return value, error status and call information 
are entered in the buffer. When the buffer becomes sufficiently full, the kernel schedules a special process 
called the analysis program to read and process the buffer contents. To generate an I/O system call trace, the 
analysis program matches call and return values, produces a file system event trace, compresses the trace 
and writes it to a file. 

The set of I/O system calls traced includes all file related activity - read, write, open, close, create, 
reposition, delete, move, and executable execution. The I/O system call traces caimot be directly used for 
I/O cache simulations since individual I/O requests refer to state information stored within the operating 
system. A post pass simulates the operating system file management and produces a stateless trace. To 
generate a stateless trace of file block read and write requests requires keeping track of the current working 
directories, and simulating the operating system file table information and file descriptors [1,7]. 

Unique identification numbers are assigned to each file. The stateless trace includes the filename of 
each I/O request in the form of a unique ID. It also includes the file type, which is implied by the system call 
and the explicit range of data bytes requested from the file. For executables the number of bytes accessed 
equals the file size. The stateless trace drives all I/O cache simulations. 
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A.2 File Types 

I/O requests resulting directly from program execution can be grouped into four categories: datafiles, exe- 
cutables, inodes, and directories. Datafiles are explicitly read or written by active processes. Executables 
are run by processes, and initiated via one of the exec system calls. Inodes and directories contain meta-data. 
Inodes contain information used by the operating system to locate the actual data on the disk. Directories 
facilitate the user organization of data and point to inodes. References to inodes and directories occur when 
opening files or evaluating access permissions. 

A.3 Workloads 

Figure 1 describes the eleven workloads evaluated in this paper. Most of the traces monitor several days of 
user activity. Such long traces are necessary to capture significant I/O activity and to show the interaction 
of the many large and small files that comprise a workload. 

The traces cover a wide range of user applications and types of work. All traces were collected in a 
research laboratory with a broad range of activities. Although not necessarily typical of heavy commercial 
use, such as large database systems, the traces should represent many engineering development and office 
environments. More detailed descriptions of the workloads are included in [12]. 

A.4 Workload Characteristics - Dynamic Features 

Most requests are to small files. Including the request contribution of inodes and directories in this measure 
skews the distribution further toward small files . Fewer than 1 % of sequential file accesses exceed 1 6 Kbytes . 
Most of the bytes transfered to and from applications, however, reside in large files and are accessed as 
multiple sequential requests. In fact, half of the bytes transferred occur in sequential runs of greater than 
64 Kbytes and a quarter of all bytes transferred are in sequential runs of more than 256 Kbytes. The 
individual data request size is normally determined by the standard libraries; large sequentail runs are 
composed of many small sequential data requests. Thus, even though large sequential accesses do not make 
up a significant fraction of the actual file requests, most of the bytes transferred by the workload occur 
in these large sequentially accessed files. This type of behavior has been measured in other workloads as 
well [9, 3]. 

Most of the datafile requests are for at most 8 Kbyte blocks regardless of the file size or run length. Large 
sequential runs thus generate many requests. Since many smaller requests produce large runs, requests to 
these runs have a considerable sequential locality that can be exploited with a cache policy. Transferring 
the runs in a few larger blocks can reduce I/O cache misses, disk overhead and the time the disk spends 
servicing requests. 

However, applications use large sequential runs infrequently, so trying to keep them around can pollute 
the cache with data unlikely to be re-referenced, and can evict many smaller objects that will be re-referenced. 
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A.5 I/O Cache Simulation 



A single cache simulator, applied and configured in many different ways, was used to study the workload 
behavior in I/O caches. The caches are assumed to be non-volalitle. Writes are allocated to the cache and 
written back only when evicted from the cache only when they become the least recently used entry. The 
FO cache simulator models fully associative I/O caches using an LRU replacement policy. 

A request manager generates I/O cache block references and then uses the LRU stack hit depth or 
cache miss information to determine the appropriate number of request misses for each given cache size. 
A single simulation produces a full range of I/O cache block behavior, read request, write request and total 
request miss behavior. 

Request Model 

An application requests file I/O. An individual I/O request may encompass several cache blocks, each of 
which may hit or miss in the cache. The number of cache block requests generated by an I/O request 
depends on the original request size, the cache block size, and the offset of the request in the file. Each file 
has a unique file ID. This ID, along with an offset into the file, forms the address for a data request. The 
file offset is converted to a cache block offset that is then used to access the I/O cache. A cache block holds 
data from only one file at a time. Caches that store actual disk blocks, rather than file blocks, could have 
pieces of several files in a single block. 

If the request size exceeds the cache size, the request is modeled as multiple requests. This is necessary 
because all requests go through the I/O cache before being delivered to the application. A request size 
greater than the cache size incurs multiple request misses; this usually only occurs for small caches. 

Executable usage is modeled as a single request. In an entirely demand paged system these would 
be many small page sized requests for the executable program. Here executables are modeled as a single 
request for the entire executable so they generate large requests. 

Read request misses reflect the number and size of read disk accesses for a workload. Since writes occur 
only when data is evicted from the cache, each disk write is a cache block in size. More advanced write 
expulsion techniques exist [II], and their impact should be similar to other reported results. 

B Workload Component Behavior in I/O Caches 

This appendix examines the cache behavior properties of various components of the I/O workload. The 
behavior differs for each component. The locality might be sensitive to cache block size or total cache size, 
but is often sensitive to both. Different components require different block and cache sizes to effectively 
capture locality. 

If all I/O requests are cached without regard to file type, few options exist for reducing the cache misses 
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Figure 15: Typical workload behavior in a unified cache. 
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Figure 16: Datafile and Executables: miss request ratios with both temporal and sequential locality. 
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and improving cache performance. Figure 15 shows the cache miss behavior of a typical workload in a 
unified cache that uses no information about request types. Due to the overwhelming number of small 
requests from inodes and directories, smaller block size choices always win. Improving the I/O cache 
performance requires more information about the statistical properties of the workload and the type of 
locality that can be captured. 

B.l Datafiles and Executables 

Datafiles have a broad distribution of file sizes, reuse rates and access patterns. The average request 
patterns of a workload determine the best block size choice. The request pattern varies considerably among 
workloads. In general, most requests access data from small, highly-reused datafiles, while most of the 
data actually transferred comes from large, sequentially-accessed datafiles. The cache must capture both 
the temporal locality of small datafiles and the sequential locality of large datafiles. 

Caching executables with datafiles eliminates consistency problems, since many executables start out 
as datafiles generated by compilers, loaders or editors. Executables and datafiles also have similar size 
distributions, even though there are fewer small executable files than small datafiles. 

The locality in the individual workloads varies, ranging from the primarily temporal locality, to the 
primarily spatial locality, to both spatial and temporal. Many workloads exhibit both spatial and temporal 
locality, increasing the block size reduces the request misses, but increases the cache size required to capture 
the working set. This is the case for the workload shown in Figure 16. For cache sizes between the small 
and large block working set capture points, the large block size produces substantially poorer performance. 
The best block size choice depends on the I/O cache size, the workload working set size, and the type of 
workload locality. 

B.2 Inodes and Directories 

The cache behavior of inodes and directories are very simular; both are small and highly reused. Including 
both in the same cache produces uniform behavior across all workloads. Figure 17 shows the read hit ratio 
and the total hit ratio for inodes and directories together in a cache with 128-byte blocks. In this cache, a 
single 512-byte directory entry occupies four blocks. The highly temporal locality of inode and directory 
requests greatly outweighs their spatial locality. 

Because of the relationship between inodes and directories, workloads often require many inodes and 
directories at the same time, so the two compete for cache space. However, both working sets are small and 
256 entries suffice to capture the inode and directory working sets and eliminate competition. This requires 
only 32 Kbytes of cache. 

The combined read and write miss ratio seen in Figure 17 is about half the read miss ratio predominately 
due to inode writes. Most inode writes merely update file access times and do not modify the file system 
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Figure 17: Inode and Directory references in a fully associative I/O cache. 

structure. Each of these updates essentially forms a read-modify-write pattern, such that the write part 
always hits in the cache. The size of the cache determines whether or not these writes get reused before 
being written back to the disk. 

B.3 Separating Spatial and Temporal Locality 

For the workloads that have large sequentially-accessed files, large blocks can dramatically reduce the 
number of misses these files generate. Segregating large files into a separate subcache controls their impact 
on the overall cache. Large files that are reused less will not push out the many smaller files that are reused 
more. This segregation works even when the large files are not sequentially accessed. 

Sequential Cache Properties 

Figure 1 8 shows the sequential cache request miss ratio (RMR) behavior of the large files - those least 
512 Kbytes in size. The workloads exhibit two types of behavior. (1) Large files are sequentially accessed 
files allowing sequential locality capture. In this case, increasing the block size produces almost ideal 
reductions in the request miss ratio. Doubling the block size reduces the RMR by almost half Doubling 
the block size from 4 Kbytes to 8 Kbytes produces a much smaller reduction in the number of misses than 
subsequent doublings because many requests access 8 Kbytes regardless of whether the cache has 4-Kbyte 
or 8-Kbyte blocks. The Application workload illustrates this behavior (2) The second type of cache behavior 
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Figure 18: Sequential cache miss request ratio for files larger than 512 Kbytes. 



exhibited by some of the workloads shows no sequential locahty. In this case, the RMR depends only on 
cache size, and not block size. The CPU Server workload falls into this class. 

The workloads that exhibit almost ideal sequential locality capture sequentially access the large files 
and do not reuse individual blocks; increasing the cache size does not capture more locahty unless the entire 
file fits in the cache. A single large block suffices to capture the sequential locahty of one active file. The 
number of blocks needed to capture sequential locality depends on the number of active files and how long 
the files block remains active in the cache. Larger blocks stay active for a longer period of time because 
the workload takes longer to consume the data. If the cache cannot hold all the active files, the cache RMR 
looks like that of a cache with smaller blocks, because actively-used blocks get expelled. Some contention 
for cache space exists between files, and this becomes more pronounced for larger cache blocks. Thus, two 
half-size blocks perform better than a single large block. 

Workloads that exhibit little sequential locality even among very large files, such as the CPU Server, 
accesses files with very large requests. Most of the large files are executables, which get accessed all at 
once, rather than datafiles, which tend to be accessed in 8-Kbyte pieces. 

Temporal Cache Properties 

The temporal cache attempts to efficiently capture file reuse (temporal locality) and capture the working 
set in minimal cache space. Smaller cache blocks increase the usable cache space by reducing the amount 
of unused space per block, and by increasing the number of independent objects that can reside in the cache 
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Figure 19: Temporal cache miss request ratio for files smaller than 512 Kbytes. 

at one time. Excluding large files eliminates the low-reuse sequential data from the cache, which increases 
the density of actively used data and allows the cache area to more effectively capture highly reused data. 

Moderate Files with Primarily Temporal Locality 

As evidenced by their cache behavior, most of the workloads contain primarily temporal locality once the 
large files have been excluded. The Application Data Analysis workload shown in Figure 19, like most 
of the workloads, exhibits only temporal locality behavior. For cache sizes smaller than the working set 
capture size, increasing the block size produces almost no reduction in the miss ratio. Increasing the block 
size proportionally increases the cache size required to capture the working set. Larger blocks do little to 
reduce the miss ratio for any cache size. 

Moderate Files with Both Temporal and Spatial Locality 

A few workloads, such as the MECCA Server (Fig. 19), exhibit both temporal and sequential locality among 
the moderate sized executables and datafiles. The two may not be easily separable. A large drop in the miss 
ratio occurs when the cache captures the temporal working set. The working set capture size is independent 
of block size, indicating the intertwined nature of its sequential and temporal locality. MECCA accesses a 
large set of medium-sized files sequentially. It reuses these files frequently, producing a large working set. 
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